Auto Insurance Claims Data

Problem Statement

  • A car insurance claim is a process wherein an insured asks the car insurance company to compensate him/her for the damages that are sustained by his/her car after an accident. Or at times, it is when the insured asks the insurance company to represent him/her or intervene on his/her behalf when the insured is responsible for any unintentional damages caused to a third-party. Since one pays car insurance premium, it is one’s right to make a claim when he/she is involved in an accident

  • we are required to build a model that will predict if the claims that were registered are fraud or not using the given data set.

Importing All the library

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression,Ridge,Lasso
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from statsmodels.formula.api import ols
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.decomposition import PCA
import xgboost as xgb
from IPython.display import Image
from sklearn.model_selection import train_test_split
from IPython.display import Image
import plotly.express as px
import plotly.graph_objects as go
import os
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\dask\config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  data = yaml.load(f.read()) or {}
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\distributed\config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)
Using TensorFlow backend.
In [2]:
pd.set_option('display.max_row', 100000)
pd.set_option('display.max_columns',500000)
In [3]:
os.getcwd()
Out[3]:
'C:\\Users\\Subhasish Das\\Desktop\\SimplyLearn\\Project\\Kaggle\\insurance'

Reading The files

In [76]:
dt_insurance=pd.read_csv('insurance_claims.csv')
In [5]:
dt_insurance.head()
Out[5]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported _c39
0 328 48 521585 2014-10-17 OH 250/500 1000 1406.91 0 466132 MALE MD craft-repair sleeping husband 53300 0 2015-01-25 Single Vehicle Collision Side Collision Major Damage Police SC Columbus 9935 4th Drive 5 1 YES 1 2 YES 71610 6510 13020 52080 Saab 92x 2004 Y NaN
1 228 42 342868 2006-06-27 IN 250/500 2000 1197.22 5000000 468176 MALE MD machine-op-inspct reading other-relative 0 0 2015-01-21 Vehicle Theft ? Minor Damage Police VA Riverwood 6608 MLK Hwy 8 1 ? 0 0 ? 5070 780 780 3510 Mercedes E400 2007 Y NaN
2 134 29 687698 2000-09-06 OH 100/300 2000 1413.14 5000000 430632 FEMALE PhD sales board-games own-child 35100 0 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Police NY Columbus 7121 Francis Lane 7 3 NO 2 3 NO 34650 7700 3850 23100 Dodge RAM 2007 N NaN
3 256 41 227811 1990-05-25 IL 250/500 2000 1415.74 6000000 608117 FEMALE PhD armed-forces board-games unmarried 48900 -62400 2015-01-10 Single Vehicle Collision Front Collision Major Damage Police OH Arlington 6956 Maple Drive 5 1 ? 1 2 NO 63400 6340 6340 50720 Chevrolet Tahoe 2014 Y NaN
4 228 44 367455 2014-06-06 IL 500/1000 1000 1583.91 6000000 610706 MALE Associate sales board-games unmarried 66000 -46000 2015-02-17 Vehicle Theft ? Minor Damage None NY Arlington 3041 3rd Ave 20 1 NO 0 1 NO 6500 1300 650 4550 Accura RSX 2009 N NaN
In [6]:
dt_insurance.columns
Out[6]:
Index(['months_as_customer', 'age', 'policy_number', 'policy_bind_date',
       'policy_state', 'policy_csl', 'policy_deductable',
       'policy_annual_premium', 'umbrella_limit', 'insured_zip', 'insured_sex',
       'insured_education_level', 'insured_occupation', 'insured_hobbies',
       'insured_relationship', 'capital-gains', 'capital-loss',
       'incident_date', 'incident_type', 'collision_type', 'incident_severity',
       'authorities_contacted', 'incident_state', 'incident_city',
       'incident_location', 'incident_hour_of_the_day',
       'number_of_vehicles_involved', 'property_damage', 'bodily_injuries',
       'witnesses', 'police_report_available', 'total_claim_amount',
       'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make',
       'auto_model', 'auto_year', 'fraud_reported', '_c39'],
      dtype='object')
In [7]:
dt_insurance.shape
Out[7]:
(1000, 40)
In [8]:
corr_matrix=dt_insurance.corr()
In [9]:
corr_matrix
Out[9]:
months_as_customer age policy_number policy_deductable policy_annual_premium umbrella_limit insured_zip capital-gains capital-loss incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim auto_year _c39
months_as_customer 1.000000 0.922098 0.057555 0.026807 0.005018 0.015498 0.017895 0.006399 0.020209 0.070639 0.014736 -0.010162 0.058383 0.062108 0.065329 0.034940 0.061013 -0.000292 NaN
age 0.922098 1.000000 0.059413 0.029188 0.014404 0.018126 0.025604 -0.007075 0.007368 0.087161 0.022102 -0.015679 0.052359 0.069863 0.075522 0.060898 0.062588 0.001354 NaN
policy_number 0.057555 0.059413 1.000000 -0.006738 0.022566 0.008968 0.007083 0.009802 -0.005669 0.000113 0.013432 -0.004558 -0.012661 -0.018009 -0.008762 -0.010678 -0.020184 -0.000183 NaN
policy_deductable 0.026807 0.029188 -0.006738 1.000000 -0.003245 0.010870 0.004545 0.035212 -0.023544 0.060935 0.051214 -0.022765 0.066639 0.022839 0.039107 0.064792 0.005269 0.026105 NaN
policy_annual_premium 0.005018 0.014404 0.022566 -0.003245 1.000000 -0.006247 0.032354 -0.013738 0.023547 -0.001578 -0.045991 0.026780 0.002332 0.009094 -0.017633 -0.011654 0.020246 -0.049226 NaN
umbrella_limit 0.015498 0.018126 0.008968 0.010870 -0.006247 1.000000 0.019671 -0.047268 -0.024056 -0.023257 -0.021270 0.022743 -0.006738 -0.040344 -0.045412 -0.023790 -0.038584 0.009893 NaN
insured_zip 0.017895 0.025604 0.007083 0.004545 0.032354 0.019671 1.000000 0.006303 0.049372 0.008274 0.027448 0.028695 0.019805 -0.033873 -0.017495 -0.006841 -0.041083 -0.032736 NaN
capital-gains 0.006399 -0.007075 0.009802 0.035212 -0.013738 -0.047268 0.006303 1.000000 -0.046904 -0.016406 0.061643 0.055829 -0.017651 0.015980 0.025934 -0.000779 0.015836 0.031398 NaN
capital-loss 0.020209 0.007368 -0.005669 -0.023544 0.023547 -0.024056 0.049372 -0.046904 1.000000 -0.025054 -0.014895 -0.024418 -0.041330 -0.036060 -0.046060 -0.022863 -0.032665 -0.056615 NaN
incident_hour_of_the_day 0.070639 0.087161 0.000113 0.060935 -0.001578 -0.023257 0.008274 -0.016406 -0.025054 1.000000 0.120794 -0.034563 0.006527 0.217702 0.165768 0.179536 0.215626 0.021368 NaN
number_of_vehicles_involved 0.014736 0.022102 0.013432 0.051214 -0.045991 -0.021270 0.027448 0.061643 -0.014895 0.120794 1.000000 0.014030 -0.014669 0.274278 0.224650 0.219084 0.269393 0.034554 NaN
bodily_injuries -0.010162 -0.015679 -0.004558 -0.022765 0.026780 0.022743 0.028695 0.055829 -0.024418 -0.034563 0.014030 1.000000 -0.005606 0.047093 0.047319 0.039749 0.043449 -0.020527 NaN
witnesses 0.058383 0.052359 -0.012661 0.066639 0.002332 -0.006738 0.019805 -0.017651 -0.041330 0.006527 -0.014669 -0.005606 1.000000 -0.011114 -0.024843 0.052640 -0.022564 0.045791 NaN
total_claim_amount 0.062108 0.069863 -0.018009 0.022839 0.009094 -0.040344 -0.033873 0.015980 -0.036060 0.217702 0.274278 0.047093 -0.011114 1.000000 0.805025 0.810686 0.982773 -0.035781 NaN
injury_claim 0.065329 0.075522 -0.008762 0.039107 -0.017633 -0.045412 -0.017495 0.025934 -0.046060 0.165768 0.224650 0.047319 -0.024843 0.805025 1.000000 0.563866 0.722878 -0.013718 NaN
property_claim 0.034940 0.060898 -0.010678 0.064792 -0.011654 -0.023790 -0.006841 -0.000779 -0.022863 0.179536 0.219084 0.039749 0.052640 0.810686 0.563866 1.000000 0.732090 -0.014508 NaN
vehicle_claim 0.061013 0.062588 -0.020184 0.005269 0.020246 -0.038584 -0.041083 0.015836 -0.032665 0.215626 0.269393 0.043449 -0.022564 0.982773 0.722878 0.732090 1.000000 -0.042768 NaN
auto_year -0.000292 0.001354 -0.000183 0.026105 -0.049226 0.009893 -0.032736 0.031398 -0.056615 0.021368 0.034554 -0.020527 0.045791 -0.035781 -0.013718 -0.014508 -0.042768 1.000000 NaN
_c39 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [10]:
plt.figure(figsize=(20,15))
sns.heatmap(dt_insurance.corr(),annot=True,cmap='coolwarm')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ce48269b38>

Data Visualization

I will plot all the independent variable with respect to fraud reported as yes

In [70]:
dt_insurance['collision_type'].replace({'?':'None'},inplace=True)
In [71]:
dt_insurance=dt_insurance[dt_insurance['fraud_reported']=='Y']
In [12]:
df_polstate=dt_insurance.groupby('fraud_reported')['policy_state'].value_counts(normalize=True)
df_polstate = df_polstate.mul(100).rename('Percent').reset_index()
In [13]:
df_polstate['Percent']=df_polstate['Percent'].round(decimals=2)
In [14]:
df_polstate.head()
Out[14]:
fraud_reported policy_state Percent
0 Y OH 36.84
1 Y IN 31.98
2 Y IL 31.17
In [15]:
px.bar(df_polstate, x='fraud_reported', y='Percent', color='policy_state',title="Fraud reported as yes w.r.t State" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows the precentage of fraud reported as yes w.r.t State
In [16]:
df_sex=dt_insurance.groupby('fraud_reported')['insured_sex'].value_counts(normalize=True)
df_sex = df_sex.mul(100).rename('Percent').reset_index()
df_sex['Percent']=df_sex['Percent'].round(decimals=2)
df_sex.head(10)
Out[16]:
fraud_reported insured_sex Percent
0 Y FEMALE 51.01
1 Y MALE 48.99
In [17]:
px.bar(df_sex, x='fraud_reported', y='Percent', color='insured_sex', title="Fraud reported as yes w.r.t Insured sex" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows the precentage of fraud reported as yes w.r.t Insured Sex
  • percentage of fraud w.r.t to sex has no such difference
In [18]:
df_inci=dt_insurance.groupby('fraud_reported')['incident_type'].value_counts(normalize=True)
df_inci = df_inci.mul(100).rename('Percent').reset_index()
df_inci['Percent']=df_inci['Percent'].round(decimals=2)
df_inci.head(10)
Out[18]:
fraud_reported incident_type Percent
0 Y Single Vehicle Collision 47.37
1 Y Multi-vehicle Collision 46.15
2 Y Parked Car 3.24
3 Y Vehicle Theft 3.24
In [19]:
px.bar(df_inci, x='fraud_reported', y='Percent', color='incident_type', title="Fraud reported as yes w.r.t Incedent type" 
                    ,barmode='group', text='Percent')

Observation

  • from the graph we can clearly see that for incident type like 'Single vehicle Collision','Multi Vehicle collision' has much higher percentage of fraud reported
In [72]:
df_colli=dt_insurance.groupby('fraud_reported')['collision_type'].value_counts(normalize=True)
df_colli = df_colli.mul(100).rename('Percent').reset_index()
df_colli['Percent']=df_colli['Percent'].round(decimals=2)
df_colli.head(10)
Out[72]:
fraud_reported collision_type Percent
0 Y Rear Collision 36.84
1 Y Front Collision 28.34
2 Y Side Collision 28.34
3 Y None 6.48
In [73]:
px.bar(df_colli, x='fraud_reported', y='Percent', color='collision_type', title="Fraud reported as yes w.r.t Collison type",
                    barmode='group', text='Percent')

Observation

  • from the graph we can clearly see that for Collision type fraud reported as yes has no such variance in the percentage
In [22]:
df_edu=dt_insurance.groupby('fraud_reported')['insured_education_level'].value_counts(normalize=True)
df_edu = df_edu.mul(100).rename('Percent').reset_index()
df_edu['Percent']=df_edu['Percent'].round(decimals=2)
df_edu.head(10)
Out[22]:
fraud_reported insured_education_level Percent
0 Y JD 17.00
1 Y MD 15.38
2 Y High School 14.57
3 Y Associate 13.77
4 Y PhD 13.36
5 Y College 12.96
6 Y Masters 12.96
In [23]:
px.bar(df_edu, x='fraud_reported', y='Percent', color='insured_education_level', title="Fraud reported as yes w.r.t insured education",
                    barmode='group', text='Percent')

Observation

  • from the graph we can clearly see that for Education level fraud reported as yes has no such variance in the percentage
In [26]:
df_hobby=dt_insurance.groupby('fraud_reported')['insured_hobbies'].value_counts(normalize=True)
df_hobby = df_hobby.mul(100).rename('Percent').reset_index()
df_hobby['Percent']=df_hobby['Percent'].round(decimals=2)
df_hobby.head(10)
Out[26]:
fraud_reported insured_hobbies Percent
0 Y chess 15.38
1 Y cross-fit 10.53
2 Y reading 6.88
3 Y yachting 6.48
4 Y board-games 5.67
5 Y base-jumping 5.26
6 Y paintball 5.26
7 Y polo 5.26
8 Y hiking 4.86
9 Y exercise 4.45
In [27]:
px.bar(df_hobby, x='fraud_reported', y='Percent', color='insured_hobbies', title="Fraud reported as yes w.r.t insured hobbies",
                    barmode='group', text='Percent')

Observation

  • From the graph I can conclude that insurerer with hobbies like chess,cross-fit has much higher percentage of making fraud claim w.r.t to other hobbies
In [32]:
df_city=dt_insurance.groupby('fraud_reported')['incident_city'].value_counts(normalize=True)
df_city = df_city.mul(100).rename('Percent').reset_index()
df_city['Percent']=df_city['Percent'].round(decimals=2)
df_city.head(10)
Out[32]:
fraud_reported incident_city Percent
0 Y Arlington 17.81
1 Y Columbus 15.79
2 Y Springfield 15.38
3 Y Hillsdale 14.17
4 Y Northbend 13.77
5 Y Riverwood 12.15
6 Y Northbrook 10.93
In [33]:
px.bar(df_city, x='fraud_reported', y='Percent', color='incident_city', title="Fraud reported as yes w.r.t Incident City"
                    ,barmode='group', text='Percent')

Observation

  • Fraudulent activites are independent of city

Feature Engeenering

In [223]:
dt_insurance.dtypes
Out[223]:
months_as_customer               int64
age                              int64
policy_deductable                int64
policy_annual_premium          float64
umbrella_limit                   int64
insured_education_level          int64
insured_occupation               int64
insured_hobbies                  int64
insured_relationship             int64
capital-gains                    int64
capital-loss                     int64
incident_severity               object
authorities_contacted            int64
incident_state                   int64
incident_city                    int64
incident_location                int64
incident_hour_of_the_day         int64
number_of_vehicles_involved      int64
property_damage                 object
bodily_injuries                  int64
witnesses                        int64
police_report_available         object
total_claim_amount               int64
injury_claim                     int64
property_claim                   int64
vehicle_claim                    int64
auto_make                        int64
auto_model                       int64
auto_year                        int64
fraud_reported                  object
incedent_month                   int64
incedent_year                    int64
IN                               uint8
OH                               uint8
MALE                             uint8
Parked Car                       uint8
Single Vehicle Collision         uint8
Vehicle Theft                    uint8
Front Collision                  uint8
Rear Collision                   uint8
Side Collision                   uint8
dtype: object
In [195]:
dt_insurance['incident_date']=pd.to_datetime(dt_insurance['incident_date'].astype('datetime64[ns]'),format='%y%m%d')
dt_insurance['policy_bind_date']=pd.to_datetime(dt_insurance['policy_bind_date'].astype('datetime64[ns]'),format='%y%m%d')
dt_insurance['incedent_month']=pd.DatetimeIndex(dt_insurance['incident_date']).month
dt_insurance['incedent_year']=pd.DatetimeIndex(dt_insurance['incident_date']).year
dt_insurance.drop(['policy_bind_date','incident_date'],axis=1,inplace=True)
In [194]:
dt_insurance=pd.read_csv('insurance_claims.csv')

I will use on-hot encodding method to change the categorical variable to numeric variable for Property Damage paramter

In [363]:
dt_insurance.property_damage.unique()
Out[363]:
array(['YES', '?', 'NO'], dtype=object)
In [227]:
xyz=pd.get_dummies(dt_insurance['property_damage'],drop_first=True)
dt_insurance=pd.concat([dt_insurance,xyz],axis=1)
dt_insurance.drop('property_damage',axis=1,inplace=True)
dt_insurance.head()
Out[227]:
months_as_customer age policy_deductable policy_annual_premium umbrella_limit insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported incedent_month incedent_year IN OH MALE Parked Car Single Vehicle Collision Vehicle Theft Front Collision Rear Collision Side Collision Minor Damage Total Loss Trivial Damage NO YES
0 328 48 1000 1406.91 0 144 74 41 170 53300 0 292 248 149 1 5 1 1 2 YES 71610 6510 13020 52080 80 28 2004 Y 1 2015 0 1 1 0 1 0 0 0 1 0 0 0 0 1
1 228 42 2000 1197.22 5000000 144 93 64 177 0 0 292 110 134 1 8 1 0 0 ? 5070 780 780 3510 65 27 2007 Y 1 2015 1 0 1 0 0 1 0 0 0 1 0 0 0 0
2 134 29 2000 1413.14 5000000 125 76 48 183 35100 0 292 262 149 1 7 3 2 3 NO 34650 7700 3850 23100 80 43 2007 N 2 2015 0 1 0 0 0 0 0 1 0 1 0 0 1 0
3 256 41 2000 1415.74 6000000 125 69 48 141 48900 -62400 292 23 152 1 5 1 1 2 NO 63400 6340 6340 50720 76 24 2014 Y 1 2015 0 0 0 0 1 0 1 0 0 0 0 0 0 0
4 228 44 1000 1583.91 6000000 145 76 48 141 66000 -46000 91 262 152 1 20 1 0 1 NO 6500 1300 650 4550 68 12 2009 N 2 2015 0 0 1 0 0 1 0 0 0 1 0 0 1 0

I will use value count method for police report availabe variable

In [229]:
insured_map=dt_insurance['police_report_available'].value_counts().to_dict()
insured_map
dt_insurance['police_report_available']=dt_insurance['police_report_available'].map(insured_map)
dt_insurance.head()
Out[229]:
months_as_customer age policy_deductable policy_annual_premium umbrella_limit insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported incedent_month incedent_year IN OH MALE Parked Car Single Vehicle Collision Vehicle Theft Front Collision Rear Collision Side Collision Minor Damage Total Loss Trivial Damage NO YES
0 328 48 1000 1406.91 0 144 74 41 170 53300 0 292 248 149 1 5 1 1 2 314 71610 6510 13020 52080 80 28 2004 Y 1 2015 0 1 1 0 1 0 0 0 1 0 0 0 0 1
1 228 42 2000 1197.22 5000000 144 93 64 177 0 0 292 110 134 1 8 1 0 0 343 5070 780 780 3510 65 27 2007 Y 1 2015 1 0 1 0 0 1 0 0 0 1 0 0 0 0
2 134 29 2000 1413.14 5000000 125 76 48 183 35100 0 292 262 149 1 7 3 2 3 343 34650 7700 3850 23100 80 43 2007 N 2 2015 0 1 0 0 0 0 0 1 0 1 0 0 1 0
3 256 41 2000 1415.74 6000000 125 69 48 141 48900 -62400 292 23 152 1 5 1 1 2 343 63400 6340 6340 50720 76 24 2014 Y 1 2015 0 0 0 0 1 0 1 0 0 0 0 0 0 0
4 228 44 1000 1583.91 6000000 145 76 48 141 66000 -46000 91 262 152 1 20 1 0 1 343 6500 1300 650 4550 68 12 2009 N 2 2015 0 0 1 0 0 1 0 0 0 1 0 0 1 0

Feature Selection

In [231]:
X=dt_insurance.drop('fraud_reported',axis=1)
y=dt_insurance['fraud_reported']
In [232]:
X.head()
Out[232]:
months_as_customer age policy_deductable policy_annual_premium umbrella_limit insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year incedent_month incedent_year IN OH MALE Parked Car Single Vehicle Collision Vehicle Theft Front Collision Rear Collision Side Collision Minor Damage Total Loss Trivial Damage NO YES
0 328 48 1000 1406.91 0 144 74 41 170 53300 0 292 248 149 1 5 1 1 2 314 71610 6510 13020 52080 80 28 2004 1 2015 0 1 1 0 1 0 0 0 1 0 0 0 0 1
1 228 42 2000 1197.22 5000000 144 93 64 177 0 0 292 110 134 1 8 1 0 0 343 5070 780 780 3510 65 27 2007 1 2015 1 0 1 0 0 1 0 0 0 1 0 0 0 0
2 134 29 2000 1413.14 5000000 125 76 48 183 35100 0 292 262 149 1 7 3 2 3 343 34650 7700 3850 23100 80 43 2007 2 2015 0 1 0 0 0 0 0 1 0 1 0 0 1 0
3 256 41 2000 1415.74 6000000 125 69 48 141 48900 -62400 292 23 152 1 5 1 1 2 343 63400 6340 6340 50720 76 24 2014 1 2015 0 0 0 0 1 0 1 0 0 0 0 0 0 0
4 228 44 1000 1583.91 6000000 145 76 48 141 66000 -46000 91 262 152 1 20 1 0 1 343 6500 1300 650 4550 68 12 2009 2 2015 0 0 1 0 0 1 0 0 0 1 0 0 1 0
In [623]:
y.head()
Out[623]:
0    Y
1    Y
2    N
3    Y
4    N
Name: fraud_reported, dtype: object
In [233]:
from sklearn.feature_selection import SelectKBest,f_classif
from sklearn.feature_selection import chi2
In [234]:
ordered_rank_features=SelectKBest(f_classif,k=20)
ordered_feature=ordered_rank_features.fit(X,dt_insurance['fraud_reported'])
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\feature_selection\univariate_selection.py:114: UserWarning:

Features [14 28] are constant.

C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\feature_selection\univariate_selection.py:115: RuntimeWarning:

invalid value encountered in true_divide

In [235]:
dfscores=pd.DataFrame(ordered_feature.scores_,columns=["Score"])
dfcolumns=pd.DataFrame(X.columns)
In [236]:
features_rank=pd.concat([dfcolumns,dfscores],axis=1)
In [237]:
features_rank.columns=['Features','Score']
features_rank
Out[237]:
Features Score
0 months_as_customer 0.421370
1 age 0.147188
2 policy_deductable 0.219163
3 policy_annual_premium 0.209284
4 umbrella_limit 3.441452
5 insured_education_level 0.270165
6 insured_occupation 0.169562
7 insured_hobbies 23.080396
8 insured_relationship 0.060092
9 capital-gains 0.366990
10 capital-loss 0.220519
11 authorities_contacted 1.124199
12 incident_state 2.061476
13 incident_city 1.063851
14 incident_location NaN
15 incident_hour_of_the_day 0.018590
16 number_of_vehicles_involved 2.689100
17 bodily_injuries 1.146656
18 witnesses 2.451026
19 police_report_available 0.770117
20 total_claim_amount 27.463776
21 injury_claim 8.328776
22 property_claim 19.327729
23 vehicle_claim 29.718214
24 auto_make 0.429121
25 auto_model 1.256620
26 auto_year 0.062735
27 incedent_month 0.831197
28 incedent_year NaN
29 IN 0.148151
30 OH 0.387155
31 MALE 0.952144
32 Parked Car 11.463255
33 Single Vehicle Collision 6.844890
34 Vehicle Theft 14.807927
35 Front Collision 1.495646
36 Rear Collision 9.334382
37 Side Collision 0.089739
38 Minor Damage 60.845234
39 Total Loss 30.151126
40 Trivial Damage 17.561740
41 NO 7.386649
42 YES 0.295389
In [364]:
features_rank.nlargest(20,'Score')
Out[364]:
Features Score
38 Minor Damage 60.845234
39 Total Loss 30.151126
23 vehicle_claim 29.718214
20 total_claim_amount 27.463776
7 insured_hobbies 23.080396
22 property_claim 19.327729
40 Trivial Damage 17.561740
34 Vehicle Theft 14.807927
32 Parked Car 11.463255
36 Rear Collision 9.334382
21 injury_claim 8.328776
41 NO 7.386649
33 Single Vehicle Collision 6.844890
4 umbrella_limit 3.441452
16 number_of_vehicles_involved 2.689100
18 witnesses 2.451026
12 incident_state 2.061476
35 Front Collision 1.495646
25 auto_model 1.256620
17 bodily_injuries 1.146656

Now I will create a dataframe with these 20 variables and apply Logistic Regression and Random Forest Classifier

In [248]:
dt_ins_ver=dt_insurance[["Minor Damage","Total Loss","vehicle_claim",
                                       "total_claim_amount","insured_hobbies","property_claim",
                                       "Trivial Damage","Vehicle Theft",
                                      "Parked Car","Rear Collision","injury_claim","NO",
                                      "Single Vehicle Collision",
                                      "umbrella_limit",
                                      "number_of_vehicles_involved","witnesses",
                                      "incident_state","Front Collision","auto_model","bodily_injuries","fraud_reported"]].copy()
In [249]:
dt_ins_ver.head()
Out[249]:
Minor Damage Total Loss vehicle_claim total_claim_amount insured_hobbies property_claim Trivial Damage Vehicle Theft Parked Car Rear Collision injury_claim NO Single Vehicle Collision umbrella_limit number_of_vehicles_involved witnesses incident_state Front Collision auto_model bodily_injuries fraud_reported
0 0 0 52080 71610 41 13020 0 0 0 0 6510 0 1 0 1 2 248 0 28 1 Y
1 1 0 3510 5070 64 780 0 1 0 0 780 0 0 5000000 1 0 110 0 27 0 Y
2 1 0 23100 34650 48 3850 0 0 0 1 7700 1 0 5000000 3 3 262 0 43 2 N
3 0 0 50720 63400 48 6340 0 0 0 0 6340 0 1 6000000 1 2 23 1 24 1 Y
4 1 0 4550 6500 48 650 0 1 0 0 1300 1 0 6000000 1 1 262 0 12 0 N

Before applying any algorithm I am checking that the data set is imbalanced or not

In [250]:
plt.figure(figsize=(8,8))
sns.countplot('fraud_reported',data=dt_ins_ver)
plt.show()
In [251]:
dt_ins_ver.fraud_reported.value_counts()
Out[251]:
N    753
Y    247
Name: fraud_reported, dtype: int64
In [252]:
x=dt_ins_ver.drop('fraud_reported',axis=1)
y=dt_ins_ver['fraud_reported']
In [253]:
x.shape,y.shape
Out[253]:
((1000, 20), (1000,))

In order to handle imbalanced data set problem SMOTETomek library is used

In [254]:
smk = SMOTETomek(random_state=42)
X_res,y_res=smk.fit_sample(x,y)

Dividing the data in train test split

In [255]:
x_train,x_test,y_train,y_test=train_test_split(X_res,y_res,test_size=0.3,random_state=42)
In [256]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape
Out[256]:
((917, 20), (917,), (393, 20), (393,))

Applying LogisticRegression

In [257]:
log_reg=LogisticRegression()
log_reg.fit(x_train,y_train)
log_pred=log_reg.predict(x_test)
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

In [258]:
cm1=confusion_matrix(y_test,log_pred)
sns.heatmap(cm1,annot=True,fmt='d')
Out[258]:
<matplotlib.axes._subplots.AxesSubplot at 0x161263ad048>
  • In this plot X-axis indicate output of the data set and Y-axis indicate output of the model
In [259]:
print(accuracy_score(y_test,log_pred))
print(classification_report(y_test,log_pred))
0.5903307888040712
              precision    recall  f1-score   support

           N       0.60      0.53      0.56       194
           Y       0.59      0.65      0.62       199

    accuracy                           0.59       393
   macro avg       0.59      0.59      0.59       393
weighted avg       0.59      0.59      0.59       393

Applying RandomForestClassifier

In [260]:
model_rand=RandomForestClassifier()
model_rand.fit(x_train,y_train)
model_rand_test=model_rand.predict(x_test)
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

In [261]:
cm1=confusion_matrix(y_test,model_rand_test)
sns.heatmap(cm1,annot=True,fmt='d')
Out[261]:
<matplotlib.axes._subplots.AxesSubplot at 0x1612777e9b0>
  • In this plot X-axis indicate output of the data set and Y-axis indicate output of the model
In [366]:
print('Accuracy Score:',accuracy_score(y_test,model_rand_test))
print(classification_report(y_test,model_rand_test))
Accuracy Score: 0.8717948717948718
              precision    recall  f1-score   support

           N       0.86      0.89      0.87       196
           Y       0.88      0.86      0.87       194

    accuracy                           0.87       390
   macro avg       0.87      0.87      0.87       390
weighted avg       0.87      0.87      0.87       390

Observation

Random Forest is giving better accuracy then Logistic Regression with 20 parameter

Now we will apply the same process using 25 variable and 15 Variable

Observation

  • For Random Forest Accuracy got decresed when I am taking 25 variable for model building
  • For Random Forest Accuracy got reamins almost same when I am taking 15 and 20 variable

Trying to improve the model by applying hyper parameter tuning

In [310]:
from sklearn.model_selection import GridSearchCV
In [311]:
model_params = {
    'n_estimators': [50, 150, 250],
    'max_features': ['sqrt', 0.25, 0.5, 0.75, 1.0],
    'min_samples_split': [2, 4, 6]
}
In [312]:
rf_model = RandomForestClassifier(random_state=1)
clf = GridSearchCV(rf_model, model_params, cv=5)
model = clf.fit(x_train,y_train)
{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 6,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 250,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}
In [314]:
grid_predict=model.predict(x_test)
In [315]:
cm1=confusion_matrix(y_test,grid_predict)
sns.heatmap(cm1,annot=True,fmt='d')
Out[315]:
<matplotlib.axes._subplots.AxesSubplot at 0x1612d136320>
  • In this plot X-axis indicate output of the data set and Y-axis indicate output of the model
In [316]:
print(accuracy_score(y_test,grid_predict))
print(classification_report(y_test,grid_predict))
0.8871794871794871
              precision    recall  f1-score   support

           N       0.91      0.86      0.88       196
           Y       0.86      0.92      0.89       194

    accuracy                           0.89       390
   macro avg       0.89      0.89      0.89       390
weighted avg       0.89      0.89      0.89       390

Observation

  • For Random Forest Accuracy got increased a bit when I apply hyper parameter technique

Conclution

  • After applying hyper parameter tuning I got the best accuracy of 88% for this dataset
  • After applying K-best method I can conclude that following variables can be used to detect a fraudulant claim with 88% accuracy
  • incident_severity, vehicle_claim, total_claim_amount, property_claim, incident_type, umbrella_limit, auto_model, incident_state, insured_hobbies, collision_type, bodily_injuries
In [ ]: